Kwanza Tukule Data Analysis¶

This report showcases the completion of the Kwanza Tukule Data Analyst Assessment, designed to evaluate my technical, analytical, and problem-solving skills. Using the provided anonymized sales dataset, I applied data cleaning, preparation, and exploratory analysis techniques to uncover actionable insights. The analysis includes identifying trends, customer segmentation, forecasting, and anomaly detection to address key business challenges. Strategic recommendations are derived from the findings, focusing on product performance, customer retention, and operational efficiency. Additionally, I developed a dashboard to visualize key metrics, enabling better decision-making. This report demonstrates my ability to analyze data effectively and present meaningful insights in a structured, professional manner.

Importing Libraries¶

In [169]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
from matplotlib import pyplot as plt
%matplotlib inline


#html export
import plotly.io as pio
pio.renderers.default = 'notebook'

Loading Dataset¶

In [170]:
df = pd.read_excel(r'Case Study Data - Read Only.xlsx')
df
Out[170]:
DATE ANONYMIZED CATEGORY ANONYMIZED PRODUCT ANONYMIZED BUSINESS ANONYMIZED LOCATION QUANTITY UNIT PRICE
0 2024-08-18 21:32:00 Category-106 Product-21f4 Business-de42 Location-1ba8 1 850.0
1 2024-08-18 21:32:00 Category-120 Product-4156 Business-de42 Location-1ba8 2 1910.0
2 2024-08-18 21:32:00 Category-121 Product-49bd Business-de42 Location-1ba8 1 3670.0
3 2024-08-18 21:32:00 Category-76 Product-61dd Business-de42 Location-1ba8 1 2605.0
4 2024-08-18 21:32:00 Category-119 Product-66e0 Business-de42 Location-1ba8 5 1480.0
... ... ... ... ... ... ... ...
333400 2024-01-09 20:49:00 Category-119 Product-e98d Business-f9ff Location-1979 1 1770.0
333401 2024-08-20 20:24:00 Category-106 Product-21f4 Business-72bc Location-689f 1 850.0
333402 2024-08-20 20:24:00 Category-76 Product-6e9c Business-72bc Location-689f 1 2575.0
333403 2024-10-10 18:19:00 Category-91 Product-523e Business-2693 Location-b27b 1 1520.0
333404 2024-10-10 18:19:00 Category-75 Product-b31e Business-2693 Location-b27b 1 4420.0

333405 rows × 7 columns

Attributes¶

  1. DATE: Represents the date on which the transaction or activity occurred.

  2. ANONYMIZED CATEGORY: Categorization of the product or service involved in the transaction, with the actual category names replaced by anonymized labels.

  3. ANONYMIZED PRODUCT: Refers to the specific product associated with the transaction, with product names anonymized for confidentiality.

  4. ANONYMIZED BUSINESS: Represents the business entity involved in the transaction.

  5. ANONYMIZED LOCATION: The location where the transaction or activity occurred.

  6. QUANTITY: The quantity of the product involved in the transaction, recorded as an integer.

  7. UNIT PRICE: The price per unit of the product.

Data Cleaning and Preparation¶

In [171]:
# make column names and values uniform
df.columns = df.columns.str.lower().str.replace(' ', '_')

categorical_columns = df.dtypes[df.dtypes == 'object'].index
for c in categorical_columns:
    df[c] = df[c].str.lower().str.replace(' ', '_')
In [172]:
df
Out[172]:
date anonymized_category anonymized_product anonymized_business anonymized_location quantity unit_price
0 2024-08-18 21:32:00 category-106 product-21f4 business-de42 location-1ba8 1 850.0
1 2024-08-18 21:32:00 category-120 product-4156 business-de42 location-1ba8 2 1910.0
2 2024-08-18 21:32:00 category-121 product-49bd business-de42 location-1ba8 1 3670.0
3 2024-08-18 21:32:00 category-76 product-61dd business-de42 location-1ba8 1 2605.0
4 2024-08-18 21:32:00 category-119 product-66e0 business-de42 location-1ba8 5 1480.0
... ... ... ... ... ... ... ...
333400 2024-01-09 20:49:00 category-119 product-e98d business-f9ff location-1979 1 1770.0
333401 2024-08-20 20:24:00 category-106 product-21f4 business-72bc location-689f 1 850.0
333402 2024-08-20 20:24:00 category-76 product-6e9c business-72bc location-689f 1 2575.0
333403 2024-10-10 18:19:00 category-91 product-523e business-2693 location-b27b 1 1520.0
333404 2024-10-10 18:19:00 category-75 product-b31e business-2693 location-b27b 1 4420.0

333405 rows × 7 columns

In [173]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333405 entries, 0 to 333404
Data columns (total 7 columns):
 #   Column               Non-Null Count   Dtype         
---  ------               --------------   -----         
 0   date                 333405 non-null  datetime64[ns]
 1   anonymized_category  333405 non-null  object        
 2   anonymized_product   333405 non-null  object        
 3   anonymized_business  333405 non-null  object        
 4   anonymized_location  333405 non-null  object        
 5   quantity             333405 non-null  int64         
 6   unit_price           333397 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(1), object(4)
memory usage: 17.8+ MB
In [174]:
# duplicates
df.duplicated().sum()
Out[174]:
3524

There are 3524 duplicates. We need to drop them

In [175]:
#dropping duplicates
df = df.drop_duplicates()
In [176]:
df.isna().sum()
Out[176]:
date                   0
anonymized_category    0
anonymized_product     0
anonymized_business    0
anonymized_location    0
quantity               0
unit_price             8
dtype: int64

The column 'unit_price' has 8 null values

In [177]:
df[df.unit_price.isna()]
Out[177]:
date anonymized_category anonymized_product anonymized_business anonymized_location quantity unit_price
108112 2024-07-03 17:53:00 category-94 product-3d7f business-4fce location-f37d 2 NaN
150961 2024-12-16 18:33:00 category-79 product-dfc8 business-8bbf location-3fc0 1 NaN
151142 2024-12-22 14:42:00 category-122 product-15e0 business-c575 location-1979 3 NaN
272379 2024-06-27 12:15:00 category-92 product-ccbc business-14b6 location-1979 1 NaN
278284 2024-08-14 21:09:00 category-101 product-84a5 business-4be1 location-bb69 21 NaN
278384 2024-12-30 14:17:00 category-95 product-15f3 business-1a74 location-f37d 1 NaN
310385 2024-03-31 14:03:00 category-114 product-9204 business-c9dc location-689f 1 NaN
327152 2024-08-13 16:20:00 category-107 product-7eed business-0d61 location-1ba8 1 NaN

Lets try to see whether its the anonymized_product that determines the unit price so that we fill the nulls

In [178]:
df_product_3d7fdf = df[df['anonymized_product'] == 'product-3d7f']
df_product_3d7fdf
Out[178]:
date anonymized_category anonymized_product anonymized_business anonymized_location quantity unit_price
196 2024-09-25 13:05:00 category-94 product-3d7f business-cdc1 location-e2f8 1 860.0
287 2024-11-13 19:36:00 category-94 product-3d7f business-7f77 location-8959 1 860.0
319 2024-11-01 16:54:00 category-94 product-3d7f business-2b91 location-3fc0 1 860.0
372 2024-07-11 11:20:00 category-94 product-3d7f business-2b24 location-66f4 5 875.0
447 2024-12-09 19:35:00 category-94 product-3d7f business-1b52 location-7f37 1 885.0
... ... ... ... ... ... ... ...
332771 2024-10-15 20:01:00 category-94 product-3d7f business-0ea3 location-4ea1 1 860.0
332802 2024-10-21 19:04:00 category-94 product-3d7f business-5415 location-128a 1 860.0
332842 2024-09-20 13:43:00 category-94 product-3d7f business-5760 location-689f 1 860.0
333002 2024-10-08 17:15:00 category-94 product-3d7f business-8603 location-b27b 1 860.0
333188 2024-11-08 21:20:00 category-94 product-3d7f business-8bbf location-3fc0 1 860.0

3164 rows × 7 columns

In [179]:
df_product_3d7fdf['unit_price'].unique()
Out[179]:
array([860., 875., 885., 870.,  nan])

For product 'product-3d7f' we can see there's 3 different unit prices: 860.0, 875.0, 885.0, 870.0

In [180]:
df_product_3d7fdf[df_product_3d7fdf['unit_price'] == 860.0]
Out[180]:
date anonymized_category anonymized_product anonymized_business anonymized_location quantity unit_price
196 2024-09-25 13:05:00 category-94 product-3d7f business-cdc1 location-e2f8 1 860.0
287 2024-11-13 19:36:00 category-94 product-3d7f business-7f77 location-8959 1 860.0
319 2024-11-01 16:54:00 category-94 product-3d7f business-2b91 location-3fc0 1 860.0
565 2024-09-15 15:52:00 category-94 product-3d7f business-f749 location-3fc0 2 860.0
780 2024-10-03 15:08:00 category-94 product-3d7f business-cdc1 location-e2f8 1 860.0
... ... ... ... ... ... ... ...
332771 2024-10-15 20:01:00 category-94 product-3d7f business-0ea3 location-4ea1 1 860.0
332802 2024-10-21 19:04:00 category-94 product-3d7f business-5415 location-128a 1 860.0
332842 2024-09-20 13:43:00 category-94 product-3d7f business-5760 location-689f 1 860.0
333002 2024-10-08 17:15:00 category-94 product-3d7f business-8603 location-b27b 1 860.0
333188 2024-11-08 21:20:00 category-94 product-3d7f business-8bbf location-3fc0 1 860.0

2344 rows × 7 columns

Lets try to see whether the anonymized category is responsible for the price 860.0

In [181]:
df_product_3d7fdf[df_product_3d7fdf['anonymized_category'] == 'category-94']
Out[181]:
date anonymized_category anonymized_product anonymized_business anonymized_location quantity unit_price
196 2024-09-25 13:05:00 category-94 product-3d7f business-cdc1 location-e2f8 1 860.0
287 2024-11-13 19:36:00 category-94 product-3d7f business-7f77 location-8959 1 860.0
319 2024-11-01 16:54:00 category-94 product-3d7f business-2b91 location-3fc0 1 860.0
372 2024-07-11 11:20:00 category-94 product-3d7f business-2b24 location-66f4 5 875.0
447 2024-12-09 19:35:00 category-94 product-3d7f business-1b52 location-7f37 1 885.0
... ... ... ... ... ... ... ...
332771 2024-10-15 20:01:00 category-94 product-3d7f business-0ea3 location-4ea1 1 860.0
332802 2024-10-21 19:04:00 category-94 product-3d7f business-5415 location-128a 1 860.0
332842 2024-09-20 13:43:00 category-94 product-3d7f business-5760 location-689f 1 860.0
333002 2024-10-08 17:15:00 category-94 product-3d7f business-8603 location-b27b 1 860.0
333188 2024-11-08 21:20:00 category-94 product-3d7f business-8bbf location-3fc0 1 860.0

3164 rows × 7 columns

Even for product 'product-3d7f' with category 'category-94', we still have no supporting data to associate the unit price to.

Hence, its safe to drop the nulls

In [182]:
# dropping nulls
df = df.dropna()
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 329873 entries, 0 to 333404
Data columns (total 7 columns):
 #   Column               Non-Null Count   Dtype         
---  ------               --------------   -----         
 0   date                 329873 non-null  datetime64[ns]
 1   anonymized_category  329873 non-null  object        
 2   anonymized_product   329873 non-null  object        
 3   anonymized_business  329873 non-null  object        
 4   anonymized_location  329873 non-null  object        
 5   quantity             329873 non-null  int64         
 6   unit_price           329873 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(1), object(4)
memory usage: 20.1+ MB

Summary of Issues Identified and Steps Taken to Resolve Them:¶

  1. Duplicate Records:
  • Issue Identified: The dataset initially contained 3524 duplicate records, which could skew the analysis and affect the accuracy of any insights derived from the data.
  • Steps Taken: The duplicates were identified and dropped from the dataset using the .duplicated().sum() method and .drop_duplicates() to ensure each record is unique and accurately represents individual sales transactions.
  1. Null Values in 'unit_price' Column:
  • Issue Identified: There were 8 null values in the unit_price column, which is critical as the price of the product must be available for proper calculations of sales value and other analyses.
  • Steps Taken: The null values in the unit_price column were identified using .isnull().sum() and subsequently dropped from the dataset using .dropna() to remove any incomplete records that might lead to errors in subsequent analysis.

Final Dataset: After addressing these issues, the dataset now contains 329,873 non-null records, with no duplicate entries and all null values removed from the unit_price column. The dataset's columns now have consistent data types and no missing or redundant data, making it ready for further analysis.

Feature Engineering:
Creating the column: “Month-Year” (e.g., August 2024) from the “DATE” column.

In [183]:
df
Out[183]:
date anonymized_category anonymized_product anonymized_business anonymized_location quantity unit_price
0 2024-08-18 21:32:00 category-106 product-21f4 business-de42 location-1ba8 1 850.0
1 2024-08-18 21:32:00 category-120 product-4156 business-de42 location-1ba8 2 1910.0
2 2024-08-18 21:32:00 category-121 product-49bd business-de42 location-1ba8 1 3670.0
3 2024-08-18 21:32:00 category-76 product-61dd business-de42 location-1ba8 1 2605.0
4 2024-08-18 21:32:00 category-119 product-66e0 business-de42 location-1ba8 5 1480.0
... ... ... ... ... ... ... ...
333398 2024-11-13 13:59:00 category-121 product-898d business-0e99 location-689f 5 1360.0
333401 2024-08-20 20:24:00 category-106 product-21f4 business-72bc location-689f 1 850.0
333402 2024-08-20 20:24:00 category-76 product-6e9c business-72bc location-689f 1 2575.0
333403 2024-10-10 18:19:00 category-91 product-523e business-2693 location-b27b 1 1520.0
333404 2024-10-10 18:19:00 category-75 product-b31e business-2693 location-b27b 1 4420.0

329873 rows × 7 columns

In [184]:
# since column 'date' is already a datetime object,

# ------------Create the 'Month-Year' column---------------#
df2 = df.copy() # creating a copy for original cleaned df

# Create a 'Month-Year' column, keeping the date as the first of the month
df2['month-year'] = df['date'].dt.to_period('M').dt.to_timestamp()

# Format it for display, but still keep the datetime
df2['month-year'] = df2['month-year'].dt.strftime('%B %Y')

# Drop the 'date' column
df2.drop(columns=['date'], inplace=True)

# Move 'Month-Year' column to the first position
columns = ['month-year'] + [col for col in df2.columns if col != 'month-year']
df2 = df2[columns]

# Display the updated DataFrame to check the new column
df2
Out[184]:
month-year anonymized_category anonymized_product anonymized_business anonymized_location quantity unit_price
0 August 2024 category-106 product-21f4 business-de42 location-1ba8 1 850.0
1 August 2024 category-120 product-4156 business-de42 location-1ba8 2 1910.0
2 August 2024 category-121 product-49bd business-de42 location-1ba8 1 3670.0
3 August 2024 category-76 product-61dd business-de42 location-1ba8 1 2605.0
4 August 2024 category-119 product-66e0 business-de42 location-1ba8 5 1480.0
... ... ... ... ... ... ... ...
333398 November 2024 category-121 product-898d business-0e99 location-689f 5 1360.0
333401 August 2024 category-106 product-21f4 business-72bc location-689f 1 850.0
333402 August 2024 category-76 product-6e9c business-72bc location-689f 1 2575.0
333403 October 2024 category-91 product-523e business-2693 location-b27b 1 1520.0
333404 October 2024 category-75 product-b31e business-2693 location-b27b 1 4420.0

329873 rows × 7 columns

EDA (Exploratory Data Analysis)¶

Sales Overview¶

In [185]:
# Calculate the 'Value' column (quantity × unit_price)
df3 = df2.copy()
df3['Value'] = df3['quantity'] * df3['unit_price']
df3
Out[185]:
month-year anonymized_category anonymized_product anonymized_business anonymized_location quantity unit_price Value
0 August 2024 category-106 product-21f4 business-de42 location-1ba8 1 850.0 850.0
1 August 2024 category-120 product-4156 business-de42 location-1ba8 2 1910.0 3820.0
2 August 2024 category-121 product-49bd business-de42 location-1ba8 1 3670.0 3670.0
3 August 2024 category-76 product-61dd business-de42 location-1ba8 1 2605.0 2605.0
4 August 2024 category-119 product-66e0 business-de42 location-1ba8 5 1480.0 7400.0
... ... ... ... ... ... ... ... ...
333398 November 2024 category-121 product-898d business-0e99 location-689f 5 1360.0 6800.0
333401 August 2024 category-106 product-21f4 business-72bc location-689f 1 850.0 850.0
333402 August 2024 category-76 product-6e9c business-72bc location-689f 1 2575.0 2575.0
333403 October 2024 category-91 product-523e business-2693 location-b27b 1 1520.0 1520.0
333404 October 2024 category-75 product-b31e business-2693 location-b27b 1 4420.0 4420.0

329873 rows × 8 columns

Calculating the sales for each unique category

In [186]:
# Group by anonymized_category
category_sales = df3.groupby('anonymized_category').agg(
    Total_Quantity=('quantity', 'sum'),
    Total_Value=('Value', 'sum')
).reset_index()

category_sales
Out[186]:
anonymized_category Total_Quantity Total_Value
0 category-100 76824 134902751.0
1 category-101 19564 35577822.0
2 category-102 1786 464463.0
3 category-104 1217 1557598.0
4 category-105 1579 2690719.0
5 category-106 6521 5932763.0
6 category-107 2729 4170797.0
7 category-108 9756 5101375.0
8 category-109 1446 1263226.0
9 category-110 10529 5483386.0
10 category-111 6715 4387343.0
11 category-113 741 1254083.0
12 category-114 3 8600.0
13 category-115 348 425360.0
14 category-116 856 422745.0
15 category-117 5 1550.0
16 category-118 21 7560.0
17 category-119 68332 103454819.0
18 category-120 169715 319178743.0
19 category-121 14669 22327643.0
20 category-122 1223 3493480.0
21 category-123 286 730730.0
22 category-124 4 10060.0
23 category-125 123 297060.0
24 category-74 941 1927871.0
25 category-75 151330 544658700.0
26 category-76 71719 344939553.0
27 category-77 28455 76741382.0
28 category-78 9766 9792609.0
29 category-79 2215 1184953.0
30 category-81 142 72061.0
31 category-82 4759 3930818.0
32 category-83 2436 4039483.0
33 category-84 11933 6798158.0
34 category-85 22997 33762533.0
35 category-86 8 3320.0
36 category-89 238 136850.0
37 category-90 15 15750.0
38 category-91 20853 44152103.0
39 category-92 6953 10468723.0
40 category-94 23668 16750815.0
41 category-95 4116 7466932.0
42 category-96 1427 2249424.0
43 category-97 2711 2628309.0
44 category-98 2152 2519695.0
45 category-99 1964 1589480.0
In [187]:
# graphing total sales and total value per anonymized category
import plotly.graph_objects as go

# bar chart for Total Quantity
bar = go.Bar(
    x=category_sales["anonymized_category"],
    y=category_sales["Total_Quantity"],
    name="Total Quantity",
    marker=dict(color="steelblue")
)

# line chart for Total Value
line = go.Scatter(
    x=category_sales["anonymized_category"],
    y=category_sales["Total_Value"],
    name="Total Value",
    mode="lines+markers",  # Line with markers
    line=dict(color="firebrick", width=3),  # Line color and width
    marker=dict(size=8),  # Marker size
    yaxis="y2"  # Assign to secondary y-axis
)

# Combine both traces
fig = go.Figure(data=[bar, line])

# Update layout for dual y-axes
fig.update_layout(
    title="Sales Overview by Category",
    xaxis_title="Category",
    yaxis=dict(
        title="Total Quantity",
        titlefont=dict(color="steelblue"),  # Match bar chart color
        tickfont=dict(color="steelblue")  # Match bar chart color
    ),
    yaxis2=dict(
        title="Total Value",
        titlefont=dict(color="firebrick"),  # Match line chart color
        tickfont=dict(color="firebrick"),  # Match line chart color
        overlaying="y",  # Overlay on the same plot
        side="right"  # Position on the right
    ),
    legend_title="Metrics",
    barmode="group",  # Group bars together
    template="plotly_white"  # Cleaner theme
)

# Show the figure
fig.show()

∴ From the combined Line and Bar graph plot, we can deduce 'category-120' has the highest amount of sales with over 169.7k total quantities sold.
However, when it comes to the total value gained, 'category-75' has the highest sales with over 544.6M in revenue

Calculating the sales for each unique business.

In [188]:
# Group by anonymized_business
business_sales = df3.groupby('anonymized_business').agg(
    Total_Quantity=('quantity', 'sum'),
    Total_Value=('Value', 'sum')
).reset_index()

business_sales
Out[188]:
anonymized_business Total_Quantity Total_Value
0 business-0000 8 10445.0
1 business-0005 1 2645.0
2 business-0029 26 77340.0
3 business-003d 98 221761.0
4 business-0072 127 225056.0
... ... ... ...
4795 business-ffa9 3 6740.0
4796 business-ffae 6 10530.0
4797 business-ffb1 266 438115.0
4798 business-ffd2 37 67723.0
4799 business-ffff 110 110285.0

4800 rows × 3 columns

For the sales per Business comparison, 2 charts is appropriate due to the large number of values plotted

In [189]:
# Line chart for Total Value (green)
bar_value = go.Line(
    x=business_sales["anonymized_business"],
    y=business_sales["Total_Value"],
    name="Total Value",
    marker=dict(color="green"),
)

# layout
layout_value = go.Layout(
    title="Total Value by Business",
    xaxis_title="Business",
    yaxis_title="Total Value",
    template="plotly_white",
)

# figure
fig_value = go.Figure(data=[bar_value], layout=layout_value)

# Show the plot
fig_value.show()
c:\Program Files\Python312\Lib\site-packages\plotly\graph_objs\_deprecations.py:378: DeprecationWarning:

plotly.graph_objs.Line is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.scatter.Line
  - plotly.graph_objs.layout.shape.Line
  - etc.


∴ From the Line plot, we can deduce 'business-978e' has the highest amount of sales with over 28M revenue.

In [190]:
# line chart for Total Quantity (blue)
bar_quantity = go.Line(
    x=business_sales["anonymized_business"],
    y=business_sales["Total_Quantity"],
    name="Total Quantity",
    marker=dict(color="blue"),
)

# layout
layout_quantity = go.Layout(
    title="Total Quantity by Business",
    xaxis_title="Business",
    yaxis_title="Total Quantity",
    template="plotly_white",
)

# figure
fig_quantity = go.Figure(data=[bar_quantity], layout=layout_quantity)

# Show the plot
fig_quantity.show()

∴ From the Line plot, we can deduce 'business-978e' has the highest amount of sales with over 13.9k total quantities sold.

Trends Over Time¶

To create a time series plot we need a datetime variable

In [191]:
# Create a copy of df
df4 = df.copy()

# Add 'Value' column
df4['value'] = df4['quantity'] * df4['unit_price']

# Create a new column 'Month-Year' in df4 (not df)
df4['month-year'] = df4['date'].dt.to_period('M').dt.to_timestamp()

# Format 'Month-Year' to display as "August 2024"
df4['month-year-formatted'] = df4['month-year'].dt.strftime('%B %Y')

# Reorder columns to make 'Month-Year-Formatted' the first column
df4 = df4[['month-year-formatted'] + [col for col in df4.columns if col != 'month-year-formatted']]

del df4['date']
In [192]:
df4.info()
<class 'pandas.core.frame.DataFrame'>
Index: 329873 entries, 0 to 333404
Data columns (total 9 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   month-year-formatted  329873 non-null  object        
 1   anonymized_category   329873 non-null  object        
 2   anonymized_product    329873 non-null  object        
 3   anonymized_business   329873 non-null  object        
 4   anonymized_location   329873 non-null  object        
 5   quantity              329873 non-null  int64         
 6   unit_price            329873 non-null  float64       
 7   value                 329873 non-null  float64       
 8   month-year            329873 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(2), int64(1), object(5)
memory usage: 25.2+ MB

Our datetime object(month-year) is set to the first of the month for aggregation, for instance, any date in January is recorded as Jan 1, 2024

In [193]:
# Group the data by 'Month-Year' and calculate the monthly average of 'Quantity' and 'Value'
monthly_avg_sales = df4.groupby('month-year').agg({'quantity': 'mean', 'value': 'mean'}).reset_index()

# Convert 'Month-Year' to a string for plotting purposes
monthly_avg_sales['month-year'] = monthly_avg_sales['month-year'].astype(str)

# Group the data by 'Month-Year' and calculate the monthly averages
monthly_avg_sales = df4.groupby('month-year').agg({'quantity': 'mean', 'value': 'mean'}).reset_index()

# Convert 'Month-Year' to a string for plotting
monthly_avg_sales['month-year'] = monthly_avg_sales['month-year'].astype(str)

# Create a line plot with dual y-axes
fig = go.Figure()

# Add the line for Value
fig.add_trace(go.Scatter(
    x=monthly_avg_sales['month-year'], 
    y=monthly_avg_sales['value'], 
    name='Average Value', 
    line=dict(color='red'),
    yaxis='y1'
))

# Add the line for Quantity
fig.add_trace(go.Scatter(
    x=monthly_avg_sales['month-year'], 
    y=monthly_avg_sales['quantity'], 
    name='Average Quantity', 
    line=dict(color='green'),
    yaxis='y2'
))

# Update layout for dual y-axes
fig.update_layout(
    title="Monthly Averages of Sales (Quantity and Value)",
    xaxis=dict(title="Month-Year"),
    yaxis=dict(
        title="Average Value",
        titlefont=dict(color="blue"),
        tickfont=dict(color="blue"),
    ),
    yaxis2=dict(
        title="Average Quantity",
        titlefont=dict(color="green"),
        tickfont=dict(color="green"),
        anchor="x",
        overlaying="y",
        side="right",
    ),
    legend=dict(x=0.5, y=1.1, orientation="h"),
)

# Show the plot
fig.show()

Monthly averages are useful for understanding the typical performance per month, especially when there are outliers or variations in the data

For the monthly average sales(Both quantity and value), there is a downward slope,indicating a decreasing trend in the sales over time.

In [194]:
# Group the data by 'Month-Year' and calculate the monthly sum of 'Quantity' and 'Value'
monthly_sum_sales = df4.groupby('month-year').agg({'quantity': 'sum', 'value': 'sum'}).reset_index()

# Convert 'Month-Year' to a string for plotting purposes
monthly_sum_sales['month-year'] = monthly_sum_sales['month-year'].astype(str)

# Group the data by 'Month-Year' and calculate the monthly averages
monthly_sum_sales = df4.groupby('month-year').agg({'quantity': 'sum', 'value': 'sum'}).reset_index()

# Convert 'Month-Year' to a string for plotting
monthly_sum_sales['month-year'] = monthly_sum_sales['month-year'].astype(str)

# Create a line plot with dual y-axes
fig = go.Figure()

# Add the line for Value
fig.add_trace(go.Scatter(
    x=monthly_sum_sales['month-year'], 
    y=monthly_sum_sales['value'], 
    name='Total Sum Value', 
    line=dict(color='red'),
    yaxis='y1'
))

# Add the line for Quantity
fig.add_trace(go.Scatter(
    x=monthly_sum_sales['month-year'], 
    y=monthly_sum_sales['quantity'], 
    name='Total Sum Quantity', 
    line=dict(color='green'),
    yaxis='y2'
))

# Update layout for dual y-axes
fig.update_layout(
    title="Monthly Sum of Sales (Quantity and Value)",
    xaxis=dict(title="Month-Year"),
    yaxis=dict(
        title="Total Value",
        titlefont=dict(color="blue"),
        tickfont=dict(color="blue"),
    ),
    yaxis2=dict(
        title="Total Quantity",
        titlefont=dict(color="green"),
        tickfont=dict(color="green"),
        anchor="x",
        overlaying="y",
        side="right",
    ),
    legend=dict(x=0.5, y=1.1, orientation="h"),
)

# Show the plot
fig.show()

For understanding overall monthly performance, aggregate the data by summing the sales for each month.

The trend has visible fluctuations, with peaks in certain months(May, July and October), it might suggest seasonal effects.

Performance analysis¶

In [195]:
# top 5 most frequently purchased products (based on Quantity).

# Group by product and sum the quantities
top_products = df4.groupby('anonymized_product')['quantity'].sum().reset_index()

# Sort in descending order and select the top 5
top_5_products = top_products.sort_values(by='quantity', ascending=False).head(5)

# Display the result
top_5_products
Out[195]:
anonymized_product quantity
338 product-66e0 46957
750 product-e805 42602
476 product-8f75 37566
127 product-29ee 35940
213 product-4156 28487
In [196]:
# Create a bar graph
fig = px.bar(
    top_5_products,
    x='anonymized_product',
    y='quantity',
    title='Top 5 Most Purchased Products (Quantity)',
    labels={'anonymized_product': 'Product', 'quantity': 'Total Quantity'},
    text='quantity'  # Display the quantity on the bars
)

# Customize the layout
fig.update_layout(
    xaxis_title='Product',
    yaxis_title='Total Quantity',
    template='plotly_white'
)

# Show the plot
fig.show()
In [197]:
# top 5 most valuable products (based on Value).

# Group by product and sum the quantities
top_value_products = df4.groupby('anonymized_product')['value'].sum().reset_index()

# Sort in descending order and select the top 5
top_value_products = top_value_products.sort_values(by='value', ascending=False).head(5)

# Display the result
top_value_products
Out[197]:
anonymized_product value
750 product-e805 262787281.0
476 product-8f75 158797460.0
338 product-66e0 70704225.0
127 product-29ee 68248274.0
213 product-4156 56956007.0
In [198]:
# Create a bar graph
fig = px.bar(
    top_value_products,
    x='anonymized_product',
    y='value',
    title='Top 5 Most Valuable Products (Value)',
    labels={'anonymized_product': 'Product', 'value': 'Total Quantity'},
    text='value'  # Display the quantity on the bars
)

# Customize the layout
fig.update_layout(
    xaxis_title='Product',
    yaxis_title='Total Value',
    template='plotly_white'
)

# Show the plot
fig.show()

Advanced Analysis¶

Perform a segmentation analysis of businesses (Anonymized Business) based on their purchasing behavior:¶

In [199]:
df4
Out[199]:
month-year-formatted anonymized_category anonymized_product anonymized_business anonymized_location quantity unit_price value month-year
0 August 2024 category-106 product-21f4 business-de42 location-1ba8 1 850.0 850.0 2024-08-01
1 August 2024 category-120 product-4156 business-de42 location-1ba8 2 1910.0 3820.0 2024-08-01
2 August 2024 category-121 product-49bd business-de42 location-1ba8 1 3670.0 3670.0 2024-08-01
3 August 2024 category-76 product-61dd business-de42 location-1ba8 1 2605.0 2605.0 2024-08-01
4 August 2024 category-119 product-66e0 business-de42 location-1ba8 5 1480.0 7400.0 2024-08-01
... ... ... ... ... ... ... ... ... ...
333398 November 2024 category-121 product-898d business-0e99 location-689f 5 1360.0 6800.0 2024-11-01
333401 August 2024 category-106 product-21f4 business-72bc location-689f 1 850.0 850.0 2024-08-01
333402 August 2024 category-76 product-6e9c business-72bc location-689f 1 2575.0 2575.0 2024-08-01
333403 October 2024 category-91 product-523e business-2693 location-b27b 1 1520.0 1520.0 2024-10-01
333404 October 2024 category-75 product-b31e business-2693 location-b27b 1 4420.0 4420.0 2024-10-01

329873 rows × 9 columns

In [200]:
# Group by Anonymized Business and calculate required metrics
business_segmentation = df4.groupby('anonymized_business').agg(
    total_quantity=('quantity', 'sum'),
    total_value=('value', 'sum'),
    transaction_frequency=('anonymized_business', 'count')
).reset_index()

# Sorting for analysis (optional)
top_businesses = business_segmentation.sort_values(by='total_value', ascending=False)
top_businesses
Out[200]:
anonymized_business total_quantity total_value transaction_frequency
2906 business-978e 13991 28037358.0 2614
4781 business-fe7d 6743 26997121.0 653
1873 business-6068 8214 16464195.0 1474
144 business-07de 6065 16258068.0 1160
2353 business-7a03 6318 13968451.0 1147
... ... ... ... ...
2616 business-8654 1 55.0 1
3666 business-c19b 1 55.0 1
1478 business-4b87 5 10.0 1
4060 business-d78d 1 2.0 1
4619 business-f712 22 0.0 1

4800 rows × 4 columns

total_quantity: Calculates the total quantity purchased by summing the quantity column for each business.
total_value: Sums the value column to calculate the total revenue contributed by each business.
transaction_frequency: Uses the count of rows for each business to determine the frequency of transactions.

In [201]:
# scatter plot to analyze relationships between total quantity, total value, and transaction frequency
fig = px.scatter(
    business_segmentation,
    x='total_quantity',
    y='total_value',
    size='transaction_frequency',
    color='transaction_frequency',
    title='Customer Segmentation Based on Purchasing Behavior',
    labels={
        'total_quantity': 'Total Quantity Purchased',
        'total_value': 'Total Value Contributed ($)',
        'transaction_frequency': 'Transaction Frequency'
    },
    hover_data=['anonymized_business']
)

# Show the plot
fig.show()

From our scatter plot we can deduce, High-value businesses which are Businesses with high total value and frequent transactions.

High value businesses would be businesses at the top right quadrant of our graph having more quantities purchased and high value contributed.

Our Top High-value business would be business-978e with the highest total quantity purchased, highest value contributed and highest transaction frequency

Classifying businesses into 3 groups (e.g., High Value, Medium Value, Low Value)¶

In [202]:
business_segmentation
Out[202]:
anonymized_business total_quantity total_value transaction_frequency
0 business-0000 8 10445.0 8
1 business-0005 1 2645.0 1
2 business-0029 26 77340.0 6
3 business-003d 98 221761.0 31
4 business-0072 127 225056.0 101
... ... ... ... ...
4795 business-ffa9 3 6740.0 3
4796 business-ffae 6 10530.0 5
4797 business-ffb1 266 438115.0 105
4798 business-ffd2 37 67723.0 22
4799 business-ffff 110 110285.0 107

4800 rows × 4 columns

In [203]:
# trying to see the distribution of total_value to calssify the businesses
fig = px.box(business_segmentation, y='total_value')

# Show the figure
fig.show()
In [204]:
# Calculate quartiles
Q1 = business_segmentation['total_value'].quantile(0.25)
Q2 = business_segmentation['total_value'].quantile(0.50)
Q3 = business_segmentation['total_value'].quantile(0.75)

# Compute IQR and upper whisker (excluding extreme outliers)
IQR = Q3 - Q1
upper_whisker = Q3 + 1.5 * IQR

# Define function to categorize values into quartiles
def categorize(value):
    if value <= Q1:
        return "0 - Q1"
    elif Q1 < value <= Q2:
        return "Q1 - Q2"
    elif Q2 < value <= Q3:
        return "Q2 - Q3"
    elif Q3 < value <= upper_whisker:
        return "Q3 - Q4"
    else:
        return "Outlier"

# Apply function to categorize values
business_segmentation['quartile_category'] = business_segmentation['total_value'].apply(categorize)

# Ensure quartile categories are ordered
quartile_order = ["0 - Q1", "Q1 - Q2", "Q2 - Q3", "Q3 - Q4", "Outlier"]

# Convert to categorical with ordered=True
business_segmentation['quartile_category'] = pd.Categorical(
    business_segmentation['quartile_category'],
    categories=quartile_order,
    ordered=True
)

# Count occurrences, now sorted correctly
quartile_counts = business_segmentation['quartile_category'].value_counts().sort_index()

# Display results
print("Count of values in each quartile:\n")
print(quartile_counts)
Count of values in each quartile:

quartile_category
0 - Q1     1200
Q1 - Q2    1200
Q2 - Q3    1200
Q3 - Q4     526
Outlier     674
Name: count, dtype: int64
In [205]:
# Get the minimum and maximum values
min_value = business_segmentation['total_value'].min()
max_value = business_segmentation['total_value'].max()

# Display results
print(f"Minimum Value: {min_value}")
print(f"First Quartile (Q1): {Q1}")
print(f"Second Quartile (Median/Q2): {Q2}")
print(f"Third Quartile (Q3): {Q3}")
print(f"Fourth Quartile (Upper Whisker): {upper_whisker}")
print(f"Maximum Value: {max_value}")
Minimum Value: 0.0
First Quartile (Q1): 9207.5
Second Quartile (Median/Q2): 41869.0
Third Quartile (Q3): 234042.0
Fourth Quartile (Upper Whisker): 571293.75
Maximum Value: 28037358.0

Based on the distribution of

In [206]:
# Define function to classify businesses based on their total_value
def classify_business(value):
    if value > Q3:
        return "high"
    elif Q2 < value <= Q3:
        return "medium"
    else:
        return "low"

# Apply the classification function to the dataframe
business_segmentation['value_category'] = business_segmentation['total_value'].apply(classify_business)

business_segmentation
Out[206]:
anonymized_business total_quantity total_value transaction_frequency quartile_category value_category
0 business-0000 8 10445.0 8 Q1 - Q2 low
1 business-0005 1 2645.0 1 0 - Q1 low
2 business-0029 26 77340.0 6 Q2 - Q3 medium
3 business-003d 98 221761.0 31 Q2 - Q3 medium
4 business-0072 127 225056.0 101 Q2 - Q3 medium
... ... ... ... ... ... ...
4795 business-ffa9 3 6740.0 3 0 - Q1 low
4796 business-ffae 6 10530.0 5 Q1 - Q2 low
4797 business-ffb1 266 438115.0 105 Q3 - Q4 high
4798 business-ffd2 37 67723.0 22 Q2 - Q3 medium
4799 business-ffff 110 110285.0 107 Q2 - Q3 medium

4800 rows × 6 columns

Business Value Segmentation & Engagement Strategies¶

1. High-Value Businesses (Q3 - Q4 & Outliers) 💰
Businesses that contribute the highest total value.

📌 Recommendations:

  • Loyalty Programs: Offer discounts, rebates, or exclusive deals for bulk purchases.
  • Early Access: Provide early access to new products or priority stock allocation.
  • Co-Marketing Initiatives: Partner for promotions, sponsorships, or cross-selling opportunities.

2. Medium-Value Businesses (Q2 - Q3) ⚖️ Businesses with moderate purchases and potential for growth.

📌 Recommendations:

  • Upselling & Cross-Selling: Introduce relevant product bundles or complementary items.
  • Flexible Payment Terms: Offer customized payment plans to increase order volume.
  • Marketing Support: Provide digital assets, promotional materials, or co-branded advertisements.
  • Incentivized Growth Plans: Introduce tiered rewards—higher purchases unlock better benefits.

3. Low-Value Businesses (0 - Q2) 📉 Businesses with minimal engagement or small purchases.

📌 Recommendations:

  • Special Promotions: Offer introductory discounts or limited-time deals to encourage higher orders.
  • Automated Follow-Ups: Use email campaigns to remind them of product benefits or new stock.
  • Survey & Feedback: Understand why their purchases are low and identify pain points.
  • Self-Service Support: Provide a digital knowledge base or chatbot assistance to answer queries.
  • Referral Incentives: Encourage referrals with discounts or bonuses for bringing in new customers.

Forecasting¶

In [210]:
# forecasting with arima
import plotly.graph_objects as go
from statsmodels.tsa.arima.model import ARIMA

# Ensure the Month-Year column is in datetime format
df4['month-year'] = pd.to_datetime(df4['month-year'])

# Aggregate total sales by month
monthly_sales = df4.groupby('month-year')['value'].sum()

# Fit an ARIMA model (order can be optimized based on ACF/PACF)
model = ARIMA(monthly_sales, order=(2,1,2))  
model_fit = model.fit()

# Forecast for the next 3 months
forecast = model_fit.forecast(steps=3)

# Generate future date range
future_dates = pd.date_range(start=monthly_sales.index[-1], periods=4, freq='M')[1:]

# Convert dates to "Month-Year" format (e.g., "January 2024")
formatted_dates = future_dates.strftime('%B %Y')

# Create Plotly figure
fig = go.Figure()

# Add historical sales data
fig.add_trace(go.Scatter(
    x=monthly_sales.index.strftime('%B %Y'),  # Format historical x-axis labels
    y=monthly_sales.values,
    mode='lines',
    name='Historical Sales',
    line=dict(color='blue')
))

# Add forecasted sales data
fig.add_trace(go.Scatter(
    x=formatted_dates, 
    y=forecast.values,
    mode='lines+markers',
    name='Forecasted Sales',
    line=dict(color='red', dash='dash')
))

# Update layout
fig.update_layout(
    title="3-Month Sales Forecast",
    xaxis_title="Month-Year",
    yaxis_title="Total Sales (Value)",
    xaxis=dict(tickmode='array', tickvals=monthly_sales.index.strftime('%B %Y')),
    template="plotly_white"
)

# Show the plot
fig.show()
c:\Program Files\Python312\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning:

No frequency information was provided, so inferred frequency MS will be used.

c:\Program Files\Python312\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning:

No frequency information was provided, so inferred frequency MS will be used.

c:\Program Files\Python312\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning:

No frequency information was provided, so inferred frequency MS will be used.

c:\Program Files\Python312\Lib\site-packages\statsmodels\tsa\statespace\sarimax.py:966: UserWarning:

Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.

c:\Program Files\Python312\Lib\site-packages\statsmodels\tsa\statespace\sarimax.py:978: UserWarning:

Non-invertible starting MA parameters found. Using zeros as starting parameters.

C:\Users\Deninjo\AppData\Local\Temp\ipykernel_10284\468917369.py:18: FutureWarning:

'M' is deprecated and will be removed in a future version, please use 'ME' instead.

Forecasted Sales (Red Dashed Line)

The forecast starts high in January 2025 and shows a gradual decline through March 2025. This suggests that the model predicts a downward trend in sales over the forecast period.

Anomaly Detection¶

In [153]:
# Identify any unusual spikes or drops in sales performance (Quantity or Value)
monthly_sum_sales = df4.groupby('month-year').agg({'quantity': 'sum', 'value': 'sum'}).reset_index()

# Convert 'Month-Year' to a string for plotting purposes
monthly_sum_sales['month-year'] = monthly_sum_sales['month-year'].astype(str)

# Group the data by 'Month-Year' and calculate the monthly averages
monthly_sum_sales = df4.groupby('month-year').agg({'quantity': 'sum', 'value': 'sum'}).reset_index()

# Convert 'Month-Year' to a string for plotting
monthly_sum_sales['month-year'] = monthly_sum_sales['month-year'].astype(str)

# Create a line plot with dual y-axes
fig = go.Figure()

# Add the line for Value
fig.add_trace(go.Scatter(
    x=monthly_sum_sales['month-year'], 
    y=monthly_sum_sales['value'], 
    name='Total Sum Value', 
    line=dict(color='red'),
    yaxis='y1'
))

# Add the line for Quantity
fig.add_trace(go.Scatter(
    x=monthly_sum_sales['month-year'], 
    y=monthly_sum_sales['quantity'], 
    name='Total Sum Quantity', 
    line=dict(color='green'),
    yaxis='y2'
))

# Update layout for dual y-axes
fig.update_layout(
    title="Monthly Sum of Sales (Quantity and Value)",
    xaxis=dict(title="Month-Year"),
    yaxis=dict(
        title="Total Value",
        titlefont=dict(color="blue"),
        tickfont=dict(color="blue"),
    ),
    yaxis2=dict(
        title="Total Quantity",
        titlefont=dict(color="green"),
        tickfont=dict(color="green"),
        anchor="x",
        overlaying="y",
        side="right",
    ),
    legend=dict(x=0.5, y=1.1, orientation="h"),
)

# Show the plot
fig.show()

From the graph, here are the notable spikes and drops in sales performance:

  1. January 2024 (Spike in Total Sum Value and Quantity)
  • Sales value peaked at around 190M, and quantity was also relatively high.
  • Possible reason: Seasonal demand, holiday sales, or a major promotional event.
  1. February to March 2024 (Steep Drop in Both Value and Quantity)
  • A sharp decline in both total value and total quantity, reaching the lowest point in March.
  • Possible reason: Post-holiday season slowdown, reduced consumer spending, or stock shortages.
  1. May 2024 (Spike in Both Value and Quantity)
  • Significant recovery from March, with a sharp increase in sales.
  • Possible reason: A new product launch, marketing campaign, or seasonal sales trend.
  1. July 2024 (Another Peak in Value and Quantity)
  • Both metrics reached one of their highest points in the year.
  • Possible reason: Mid-year sales event, discounts, or increased demand in specific product categories.
  1. November 2024 (Sharp Drop in Value, Moderate Drop in Quantity)
  • The quantity remains relatively high compared to previous dips, but the value drops significantly.
  • Possible reason: Discounted sales leading to higher volumes but lower revenue per unit.

Correlation Analysis¶

In [ ]:
df4
Out[ ]:
month-year-formatted anonymized_category anonymized_product anonymized_business anonymized_location quantity unit_price value month-year
0 August 2024 category-106 product-21f4 business-de42 location-1ba8 1 850.0 850.0 2024-08-01
1 August 2024 category-120 product-4156 business-de42 location-1ba8 2 1910.0 3820.0 2024-08-01
2 August 2024 category-121 product-49bd business-de42 location-1ba8 1 3670.0 3670.0 2024-08-01
3 August 2024 category-76 product-61dd business-de42 location-1ba8 1 2605.0 2605.0 2024-08-01
4 August 2024 category-119 product-66e0 business-de42 location-1ba8 5 1480.0 7400.0 2024-08-01
... ... ... ... ... ... ... ... ... ...
333398 November 2024 category-121 product-898d business-0e99 location-689f 5 1360.0 6800.0 2024-11-01
333401 August 2024 category-106 product-21f4 business-72bc location-689f 1 850.0 850.0 2024-08-01
333402 August 2024 category-76 product-6e9c business-72bc location-689f 1 2575.0 2575.0 2024-08-01
333403 October 2024 category-91 product-523e business-2693 location-b27b 1 1520.0 1520.0 2024-10-01
333404 October 2024 category-75 product-b31e business-2693 location-b27b 1 4420.0 4420.0 2024-10-01

329873 rows × 9 columns

In [155]:
# correlation matrix between nmerical columns
X = df4[[ 'quantity', 'unit_price','value']]

# Calculate the correlation matrix
correlation_matrix = X.corr()

# Create a heatmap using seaborn
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", cbar=True)

# Add a title
plt.title("Correlation Heatmap of Selected Features", fontsize=16)

# Show the heatmap
plt.show()
No description has been provided for this image

Sales Performance is Primarily Driven by Volume: The strong correlation between quantity and value suggests selling more units is the best way to increase revenue.

Pricing has a Limited Effect on Revenue: Since unit price has a weak correlation with total value, raising prices may not significantly boost overall sales revenue.

Marketing Strategies Should Focus on Increasing Quantity Sold:

  • Offering bulk discounts or promotions could drive more sales.
  • Targeting high-demand products with aggressive sales campaigns may be more effective than price increases.

Strategic Insights and Recommendations¶

Product Strategy¶
In [ ]:
# Group by Anonymized Category and calculate required metrics
category_segmentation = df4.groupby('anonymized_category').agg(
    total_quantity=('quantity', 'sum'),
    total_value=('value', 'sum'),
    transaction_frequency=('anonymized_category', 'count')
).reset_index()

# Sorting for analysis (optional)
top_categories = category_segmentation.sort_values(by='total_value', ascending=False)
top_categories.head()
Out[ ]:
anonymized_category total_quantity total_value transaction_frequency
25 category-75 151330 544658700.0 47769
26 category-76 71719 344939553.0 41473
18 category-120 169715 319178743.0 61076
0 category-100 76824 134902751.0 18720
17 category-119 68332 103454819.0 31634
27 category-77 28455 76741382.0 13586
38 category-91 20853 44152103.0 15891
1 category-101 19564 35577822.0 6636
34 category-85 22997 33762533.0 13015
19 category-121 14669 22327643.0 7281
40 category-94 23668 16750815.0 10859
39 category-92 6953 10468723.0 4977
28 category-78 9766 9792609.0 6689
41 category-95 4116 7466932.0 3106
33 category-84 11933 6798158.0 8628
5 category-106 6521 5932763.0 4919
9 category-110 10529 5483386.0 6064
7 category-108 9756 5101375.0 2534
10 category-111 6715 4387343.0 5090
6 category-107 2729 4170797.0 2107
32 category-83 2436 4039483.0 1849
31 category-82 4759 3930818.0 2503
20 category-122 1223 3493480.0 567
4 category-105 1579 2690719.0 1186
43 category-97 2711 2628309.0 1597
44 category-98 2152 2519695.0 1569
42 category-96 1427 2249424.0 1041
24 category-74 941 1927871.0 158
45 category-99 1964 1589480.0 1336
3 category-104 1217 1557598.0 999
8 category-109 1446 1263226.0 947
11 category-113 741 1254083.0 308
29 category-79 2215 1184953.0 1313
21 category-123 286 730730.0 177
2 category-102 1786 464463.0 1036
13 category-115 348 425360.0 242
14 category-116 856 422745.0 618
23 category-125 123 297060.0 63
36 category-89 238 136850.0 176
30 category-81 142 72061.0 87
37 category-90 15 15750.0 15
22 category-124 4 10060.0 4
12 category-114 3 8600.0 3
16 category-118 21 7560.0 14
35 category-86 8 3320.0 7
15 category-117 5 1550.0 4
In [ ]:
# scatter plot to analyze relationships between total quantity, total value, and transaction frequency
fig = px.scatter(
    category_segmentation,
    x='total_quantity',
    y='total_value',
    size='transaction_frequency',
    color='transaction_frequency',
    title='Category Segmentation Based on Purchasing Behavior',
    labels={
        'total_quantity': 'Total Quantity Purchased',
        'total_value': 'Total Value Contributed ($)',
        'transaction_frequency': 'Transaction Frequency'
    },
    hover_data=['anonymized_category']
)

# Show the plot
fig.show()

Based on my analysis, I would recommend product category 'category-75' to prioritize for marketing campaigns as it has the highest total value contributed at around 544.6M with 151.3k total quantities purchased

Customer Retention¶
In [159]:
#  Identify businesses that have reduced their purchase frequency over time

# Convert to datetime if not already
df4['month-year-formatted'] = pd.to_datetime(df4['month-year-formatted'])

# Format month-year as "Month Year" (e.g., "August 2024")
df4['month-year-formatted'] = df4['month-year-formatted'].dt.strftime('%B %Y')

# Group by business and formatted month-year, count purchases in each month
purchase_trend = df4.groupby(['anonymized_business', 'month-year-formatted']).size().reset_index(name='purchase_count')

# Calculate the trend in purchase frequency per business
purchase_trend['prev_purchase_count'] = purchase_trend.groupby('anonymized_business')['purchase_count'].shift(1)
purchase_trend['change'] = purchase_trend['purchase_count'] - purchase_trend['prev_purchase_count']

# Identify businesses with a declining trend
declining_customers = purchase_trend[purchase_trend['change'] < 0]

declining_customers
C:\Users\Deninjo\AppData\Local\Temp\ipykernel_10284\855654940.py:2: UserWarning:

Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.

Out[159]:
anonymized_business month-year-formatted purchase_count prev_purchase_count change
3 business-0000 November 2024 1 2.0 -1.0
11 business-0072 December 2024 10 23.0 -13.0
13 business-0072 June 2024 8 14.0 -6.0
14 business-0072 May 2024 3 8.0 -5.0
17 business-0072 September 2024 10 22.0 -12.0
... ... ... ... ... ...
20675 business-ffd2 June 2024 3 8.0 -5.0
20676 business-ffd2 March 2024 2 3.0 -1.0
20678 business-ffd2 September 2024 4 5.0 -1.0
20681 business-ffff October 2024 25 48.0 -23.0
20682 business-ffff September 2024 18 25.0 -7.0

7267 rows × 5 columns

Since the dataset on declining customers is huge, lets try to get some key insights from the dataset

In [160]:
# Summary Statistics on Purchase Change
declining_customers['change'].describe()
Out[160]:
count    7267.000000
mean      -10.530893
std        13.789538
min      -310.000000
25%       -13.000000
50%        -6.000000
75%        -3.000000
max        -1.000000
Name: change, dtype: float64
In [ ]:
# Businesses with the Biggest Declines
top_decliners = declining_customers.nsmallest(25, 'change')
top_decliners
Out[ ]:
anonymized_business month-year-formatted purchase_count prev_purchase_count change
12637 business-978e February 2024 115 425.0 -310.0
17010 business-cf0d November 2024 26 313.0 -287.0
16745 business-cb1f September 2024 219 451.0 -232.0
2941 business-245e July 2024 25 196.0 -171.0
10187 business-78a8 November 2024 6 160.0 -154.0
9081 business-6baf January 2024 2 155.0 -153.0
8047 business-6011 September 2024 18 164.0 -146.0
12645 business-978e September 2024 278 415.0 -137.0
3006 business-24c3 September 2024 37 171.0 -134.0
4763 business-3955 July 2024 27 149.0 -122.0
13826 business-a8bd March 2024 23 141.0 -118.0
525 business-07de December 2024 6 114.0 -108.0
534 business-07de September 2024 111 216.0 -105.0
7188 business-5677 September 2024 106 210.0 -104.0
3147 business-263c December 2024 1 104.0 -103.0
10341 business-7ab4 November 2024 24 125.0 -101.0
15483 business-bc52 September 2024 34 129.0 -95.0
5891 business-468e November 2024 59 151.0 -92.0
10271 business-7a03 February 2024 18 110.0 -92.0
18098 business-de42 March 2024 17 107.0 -90.0
13421 business-a3fe September 2024 47 136.0 -89.0
18907 business-e997 December 2024 43 130.0 -87.0
2525 business-1f55 November 2024 17 103.0 -86.0
8076 business-6068 March 2024 47 133.0 -86.0
2713 business-2197 February 2024 7 91.0 -84.0
In [165]:
# Trend Over Time (Monthly Summary)

# Ensure 'month-year-formatted' is in datetime format
declining_customers['month-year-formatted'] = pd.to_datetime(declining_customers['month-year-formatted'], format='%B %Y')

# Group by month-year and sum changes
monthly_decline = declining_customers.groupby('month-year-formatted')['change'].sum()

# Sort by date (ensures chronological order)
monthly_decline = monthly_decline.sort_index(ascending=True)

# Convert back to string format for proper labeling
monthly_decline.index = monthly_decline.index.strftime('%B %Y')

# Create an interactive bar chart using Plotly
fig = px.bar(
    x=monthly_decline.index,
    y=monthly_decline.values,
    labels={'x': 'Month-Year', 'y': 'Total Decline in Purchases'},
    title="Total Monthly Decline in Purchases",

)

# Improve layout
fig.update_layout(
    xaxis_title="Month-Year",
    yaxis_title="Total Decline",
    xaxis=dict(tickangle=-45),  # Rotate x-axis labels for better readability
    bargap=0.3  # Adjust bar spacing
)

# Show the interactive graph
fig.show()
C:\Users\Deninjo\AppData\Local\Temp\ipykernel_10284\4292986156.py:4: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Re-engagement Strategies

From the top 25 identified businesses with declining purchases, you can implement the following targeted strategies:

📌 Personalized Email & Promotions

  • Offer discounts or special deals for returning customers.
  • Send a personalized reminder about their past purchases and recommend new products.

📌 Loyalty Programs & Incentives

  • Create a points-based loyalty system for frequent buyers.
  • Offer exclusive rewards to customers who return after a long time.

📌 Customer Feedback & Surveys

  • Send a short survey asking why they’ve reduced their purchases.
  • Offer an incentive (e.g., a discount) for completing the survey.

📌 Targeted Ads & Retargeting

  • Use Facebook, Google Ads, or LinkedIn to show personalized ads to customers who haven’t purchased recently.
  • Implement email reminders with abandoned cart recovery tactics.

Open-Ended Problem¶

If the dataset were 10 times larger, scalability would become a key concern for data storage, processing, and analysis.

Here’s how I would optimize each aspect:

  1. Data Storage Optimization
    Database Choices
  • Switch to a Columnar Database: If the data is primarily for analytical purposes, use columnar databases like Amazon Redshift, ClickHouse, or Google BigQuery. These are optimized for aggregations and queries.

    Partitioning & Indexing:
  • Partitioning: Split data by time (e.g., monthly or yearly) or category (e.g., region, product type). This speeds up queries.

    Move to Distributed Storage if Needed
  • Cloud Storage Solutions (e.g., AWS S3, Google Cloud Storage, Azure Blob) → Store raw data in a data lake for scalable access.
  • Distributed Databases (e.g., Amazon Redshift, Snowflake, Apache Cassandra) → If frequent querying is needed, a distributed SQL-based system can parallelize queries efficiently.


  1. Data Processing Optimization
    Batch vs. Streaming Processing
  • Batch Processing (ETL/ELT) → Use Apache Spark, Dask, or Polars instead of Pandas for handling large datasets.
  • Streaming Processing (If Real-Time Forecasting is Needed) → Use Kafka + Spark Streaming or Flink for real-time sales data updates.

    Parallel & Distributed Computing
  • Leverage Multi-Core Processing: Use Dask, Vaex, or Modin instead of pandas for handling large DataFrames.
  • Distributed Computing Frameworks: Apache Spark (PySpark) for large-scale processing. Google BigQuery or Snowflake for cloud-based distributed queries.


  1. Data Analysis Optimization
    Efficient Aggregations & Queries
  • Precompute Aggregates: Store summary tables (e.g., sales per region/month) to avoid recalculating frequently.
  • Materialized Views: Create cached views of common queries to speed up performance.

    Hardware & Cloud Scaling
  • Scale-Up vs. Scale-Out: Vertical Scaling (More RAM, SSDs) for faster processing. Horizontal Scaling (Distributed clusters, parallel processing) for massive datasets.


  1. Forecasting Model Optimization
    Efficient Machine Learning Approaches
  • Use gradient boosting (e.g., XGBoost, LightGBM) or deep learning (e.g., LSTMs, Transformers) for better time-series forecasting.
  • If real-time updates are needed, use incremental learning models (e.g., River, online learning algorithms).

Final Thoughts

If the dataset grows 100 times larger, transitioning to cloud-based solutions (AWS, GCP, Azure) with serverless data warehouses (BigQuery, Snowflake) would be the best approach.